In this study, I will try to get insights about the esoph data to see if (o)esophageal cancer is related to alcohol consumption, age or tobacco consumption, and will try to get meaning from Young People Survey’s Hobbies & Interests data by using some statistical methods and visualizations. For the data source, you can find the link at the end of the page.
Esoph is R-Data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France. Let’s investigate it first!
## agegp alcgp tobgp ncases ncontrols
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 1.00
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 3.00
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 6.00
## 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean :11.08
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00
## 75+ :11 Max. :17.000 Max. :60.00
Our data consists of three categorical variables namely agegp, alcgp two numerical columns, ncases and ncontrols. Number of cases(ncases) ranges from 0 to 17 while number of controls(ncontrols) ranges from 1 to 60.
## 'data.frame': 88 obs. of 5 variables:
## $ agegp : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alcgp : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
## $ tobgp : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
## $ ncases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ncontrols: num 40 10 6 5 27 7 4 7 2 1 ...
Here, we can see variable types better. esoph_data has 88 observations of 5 variables.
Below, you can see that the distribution of number of cancer cases are left skewed and 0 cases constitutes 29 of the cancer cases.
a<-ggplot(esoph_data,aes(esoph_data$ncases))+geom_bar(fill="blue",alpha=0.5)+labs(x= 'Number of Cases', y= 'Count')
plotly::ggplotly(a)Below, you can see that percentage of cancer cases tends to be higher when getting older. However, the highest cancer percentage is observed between 65-74 ages and lowest is observed between the 25-34 age group.
b<-esoph_data %>% group_by(agegp)%>%
summarise(total_cases_by_age=sum(ncases),total_control_By_age=sum(ncontrols),perc=total_cases_by_age/total_control_By_age) %>%
ggplot(.,aes(x=agegp, y=perc, fill=agegp)) +
labs(x="Age Group",y="Percentage of Cases",title="Relation between Percentage of Cancer Cases and Age Group")+
geom_bar(stat="identity", position = "dodge")+
theme(axis.text.x = element_text(size=7,vjust=0.4))+
guides(fill=guide_legend(title="Age Groups"))+
theme_minimal()
plotly::ggplotly(b)Below, you can see the distribution of the number of cases by all six age groups. The highest median is found to be 4 in 55-64 age group and maximum number of cases is observed as 17 in 65-74 age group. You can analyze the distribution details by scrolling through the each box blot.
Below, the tobacco consumption of each age group can be seen in six graphs. In the age groups of 45-54 and 55-64, the percentage of cancer cases are significantly higher when tobacco consumption is more than 30 g/day. In the age groups of 25-34 and 35-44, the number and percentage of cancer cases are observed significantly less than other groups.
d<-esoph_data %>% group_by(agegp,tobgp) %>%
summarise(total_cases = sum(ncases), total_controls = sum(ncontrols),percentage=total_cases/total_controls)
d1<-ggplot(d,aes(x=tobgp,y=percentage,fill=tobgp))+
geom_bar(stat="identity")+
facet_wrap(~agegp,scales="fixed")+
guides(fill=guide_legend(title="Tobacco cons"))+
labs(title="Tobacco Consumption vs Percentage of Cancer Cases by Age Group",x="Tobacco Consumption", y= 'Percentage Of Cancer Cases')
plotly::ggplotly(d1)In the age groups of 25-34 and 35-44, contrary to tobacco consumption, when the alcohol consumption is higher than 120g/day, cancer can be observed. Again, in each age group, percentage of cancer cases tends to be proportional with the alcohol consumption per day.
e<-esoph %>% group_by(agegp,alcgp) %>%
summarise(total_cases = sum(ncases), total_controls = sum(ncontrols),percentage=total_cases*100/total_controls)
e1<-ggplot(e,aes(x=alcgp,y=percentage,fill=alcgp))+
geom_bar(stat="identity")+
facet_wrap(~agegp,scales="fixed")+
guides(fill=guide_legend(title="Alcohol cons"))+
labs(title="Alcohol Consumption vs Percentage of Cancer Cases by Age Group",x="Alcohol Consumption", y= 'Percentage Of Cancer Cases')
plotly::ggplotly(e1)As can be seen from the box plots, number of controls tends to be lower in 25-34 and 75+ age group and maximum control number is observed in 35-44 age group. For a detailed information, you can scroll through the graphs.
Below, you can see number of cases over number of controls points are distributed along the lowest left corner of the plot. The outliers in both number of controls and number of cases can be seen better now.
g<-ggplot(esoph_data,aes(x=ncases,y=ncontrols,group=1,color=agegp))+
ggtitle(label="Number of Cases over Number of Controls")+
theme(plot.title = element_text(hjust=1))+
geom_point()+
theme_minimal() +
labs(x = "Number of Cases",y = "Number of Controls", color="Age group") +
theme(axis.text.x = element_text(size=7,vjust=0.4))
plotly::ggplotly(g)## effects of alcohol, tobacco and interaction, age-adjusted
model1 <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp * alcgp,
data = esoph, family = binomial())
anova(model1)## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: cbind(ncases, ncontrols)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 87 227.241
## agegp 5 88.128 82 139.112
## tobgp 3 19.085 79 120.028
## alcgp 3 66.054 76 53.973
## tobgp:alcgp 9 6.489 67 47.484
## Try a linear effect of alcohol and tobacco
model2 <- glm(cbind(ncases, ncontrols) ~ agegp + unclass(tobgp)
+ unclass(alcgp),
data = esoph, family = binomial())
summary(model2)##
## Call:
## glm(formula = cbind(ncases, ncontrols) ~ agegp + unclass(tobgp) +
## unclass(alcgp), family = binomial(), data = esoph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7628 -0.6426 -0.2709 0.3043 2.0421
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.01097 0.31224 -12.846 < 2e-16 ***
## agegp.L 2.96113 0.65092 4.549 5.39e-06 ***
## agegp.Q -1.33735 0.58918 -2.270 0.02322 *
## agegp.C 0.15292 0.44792 0.341 0.73281
## agegp^4 0.06668 0.30776 0.217 0.82848
## agegp^5 -0.20288 0.19523 -1.039 0.29872
## unclass(tobgp) 0.26162 0.08198 3.191 0.00142 **
## unclass(alcgp) 0.65308 0.08452 7.727 1.10e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 227.241 on 87 degrees of freedom
## Residual deviance: 59.277 on 80 degrees of freedom
## AIC: 222.76
##
## Number of Fisher Scoring iterations: 6
Our median residuals is close to zero which means our model is not biased in one direction (i.e. the out come is neither over- nor underestimated). The contributions of age group and alcohol group to the deviance is higher than tobgp and both tobgp:alcgp, as can be seen from the first model.
From the second generalized linear model, we can observe that agegp.L, tobgp and alcgp has a positive effect, whereas agegp.Q and agegp^5 has negative effect. tobgp, alcgp, agegp.L and agegp.Q are statistically significant. Null deviance is relatively higher than residual deviance, which is a good sign! This means using more than a single parameter explains the model better.
First let’s read the data.
Now, we get the related columns that we will work on and clean the data.
Now, our data is ready for some interesting insights! First, let’s see if we can reduce the dimensionality of this large data set while keeping as much information as possible. For that, we will use Principal Component Analysis(PCA).
## History Psychology Politics Mathematics Physics
## History 1.000000000 0.2963735855 0.40623429 0.008786356 0.07088193
## Psychology 0.296373585 1.0000000000 0.19108731 0.043974943 0.07142061
## Politics 0.406234287 0.1910873113 1.00000000 0.103492328 0.13388052
## Mathematics 0.008786356 0.0439749434 0.10349233 1.000000000 0.60785444
## Physics 0.070881929 0.0714206142 0.13388052 0.607854436 1.00000000
## Internet 0.006612009 -0.0003077005 0.04428471 0.156686522 0.10666200
## Internet
## History 0.0066120093
## Psychology -0.0003077005
## Politics 0.0442847125
## Mathematics 0.1566865225
## Physics 0.1066619993
## Internet 1.0000000000
Here, we see that mathematics and physics are positively correlated, unsurprisingly!
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.0433958 1.8069970 1.60438109 1.46391156 1.26463164
## Proportion of Variance 0.1304833 0.1020387 0.08043871 0.06696991 0.04997791
## Cumulative Proportion 0.1304833 0.2325220 0.31296072 0.37993063 0.42990854
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 1.18461175 1.06955650 1.05602251 1.04363010 1.00633062
## Proportion of Variance 0.04385328 0.03574847 0.03484949 0.03403637 0.03164692
## Cumulative Proportion 0.47376182 0.50951029 0.54435978 0.57839615 0.61004306
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 0.95951254 0.93789360 0.93168719 0.89503805 0.87262765
## Proportion of Variance 0.02877076 0.02748889 0.02712628 0.02503416 0.02379622
## Cumulative Proportion 0.63881382 0.66630271 0.69342899 0.71846315 0.74225937
## Comp.16 Comp.17 Comp.18 Comp.19 Comp.20
## Standard deviation 0.85510068 0.8329419 0.80296355 0.78632469 0.76747805
## Proportion of Variance 0.02284991 0.0216810 0.02014845 0.01932208 0.01840695
## Cumulative Proportion 0.76510928 0.7867903 0.80693874 0.82626082 0.84466777
## Comp.21 Comp.22 Comp.23 Comp.24 Comp.25
## Standard deviation 0.74821762 0.72513227 0.71054805 0.70905599 0.69087699
## Proportion of Variance 0.01749468 0.01643178 0.01577745 0.01571126 0.01491597
## Cumulative Proportion 0.86216245 0.87859422 0.89437168 0.91008294 0.92499891
## Comp.26 Comp.27 Comp.28 Comp.29 Comp.30
## Standard deviation 0.65750430 0.64296368 0.60888823 0.5954011 0.551814415
## Proportion of Variance 0.01350975 0.01291882 0.01158578 0.0110782 0.009515598
## Cumulative Proportion 0.93850866 0.95142748 0.96301326 0.9740915 0.983607058
## Comp.31 Comp.32
## Standard deviation 0.539017589 0.483770803
## Proportion of Variance 0.009079374 0.007313568
## Cumulative Proportion 0.992686432 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## History 0.196 0.184 0.243 0.130 0.243 0.142 0.194
## Psychology 0.244 0.119 0.112 -0.102
## Politics 0.131 0.202 0.255 0.204 0.270 0.153 -0.108
## Mathematics 0.295 -0.126 0.133 -0.309 -0.220 -0.273
## Physics 0.344 -0.241 0.136 -0.155 -0.151 -0.151
## Internet 0.237 0.127 -0.400 0.370
## PC 0.377 -0.134 -0.291 0.137 0.144
## Economy.Management 0.174 0.306 0.214 -0.182 -0.166 -0.288
## Biology 0.269 -0.377 -0.151 0.221
## Chemistry 0.199 -0.421 0.230
## Reading 0.282 -0.188 0.176 0.173
## Geography 0.160 0.145 0.151 0.218 0.226 0.243
## Foreign.languages 0.208 0.217 -0.117 -0.165 0.227
## Medicine 0.269 -0.317 -0.118 0.266
## Law 0.143 0.147 0.277 0.363 0.116 -0.142
## Cars 0.350 -0.168
## Art.exhibitions 0.313 -0.201
## Religion 0.230 0.166 0.128
## Countryside..outdoors 0.194 -0.102 -0.334 0.167 0.119
## Dancing 0.241 -0.230 -0.107 -0.326
## Musical.instruments 0.210 -0.361 -0.158
## Writing 0.231 0.161 -0.194 -0.114 0.214 -0.171
## Passive.sport 0.162 -0.222 -0.135 0.116
## Active.sport 0.200 -0.241 -0.138 0.340 -0.286
## Gardening 0.192 -0.178 0.445 -0.200
## Celebrities 0.183 -0.351 0.181 -0.289 0.234
## Shopping -0.136 0.178 -0.399 0.153 -0.241
## Science.and.technology 0.348 0.195
## Theatre 0.302 -0.128 -0.144 -0.256 0.115
## Fun.with.friends 0.135 -0.260 -0.154 -0.498 0.208
## Adrenaline.sports 0.254 -0.221 -0.200 0.313 -0.179
## Pets -0.256 0.138 0.132
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## History 0.162 0.252 0.233
## Psychology 0.214 -0.181 0.232 0.272 0.536 0.153 -0.266
## Politics 0.139 -0.187 0.146
## Mathematics 0.360 -0.129 0.147
## Physics 0.127 0.160 0.130
## Internet -0.107 0.147 0.264 0.174 0.157
## PC -0.183 0.100
## Economy.Management -0.163 0.189 -0.190 -0.148
## Biology
## Chemistry -0.142 0.101
## Reading 0.116 -0.130 -0.150 0.177
## Geography -0.335 0.105 -0.349 -0.228 0.113 0.155
## Foreign.languages -0.364 -0.122 -0.187 0.114 -0.408
## Medicine -0.135 -0.178 -0.101
## Law 0.135 -0.179 -0.158
## Cars 0.101 0.125 -0.331 -0.100 -0.174
## Art.exhibitions 0.166 -0.156 0.114 -0.343
## Religion -0.220 0.370 0.456 -0.146 0.147 0.103
## Countryside..outdoors -0.240 0.371 0.412
## Dancing -0.204 -0.108 0.114 0.102 0.145 0.200
## Musical.instruments -0.106 -0.226 0.270 -0.141 -0.110 -0.428 0.197
## Writing 0.212 -0.319 -0.309 0.149
## Passive.sport -0.150 0.259 0.560 -0.466 0.125 -0.296
## Active.sport -0.204 -0.214 0.153 0.175
## Gardening 0.108 0.220 -0.109 0.146
## Celebrities 0.185
## Shopping -0.108 0.105
## Science.and.technology 0.194 -0.169 -0.186 0.121 -0.194 -0.182
## Theatre 0.190 0.145 -0.258 -0.252
## Fun.with.friends 0.173 -0.147 0.130 0.471
## Adrenaline.sports 0.103 -0.120 0.141
## Pets 0.467 0.336 0.108 0.306 -0.465 0.209
## Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 Comp.21 Comp.22
## History 0.141 0.130 0.369
## Psychology -0.125 0.116 0.242 0.163
## Politics 0.149 0.140 -0.154
## Mathematics 0.183 -0.171 0.169
## Physics 0.150 -0.125 0.126 0.162
## Internet 0.195 -0.216 -0.225 -0.113
## PC 0.226 -0.103 -0.249
## Economy.Management -0.333 -0.159 -0.224
## Biology -0.141
## Chemistry -0.102
## Reading -0.411 -0.271 0.131
## Geography -0.236 0.302 -0.223 0.285
## Foreign.languages -0.329 0.154 0.303 0.202 0.111
## Medicine -0.165
## Law 0.187 -0.301 -0.262 0.183
## Cars 0.164 -0.145 0.380 0.464
## Art.exhibitions 0.122 -0.313 0.286 -0.199
## Religion 0.125 -0.435 -0.322 -0.259
## Countryside..outdoors 0.192 0.189 -0.397 0.125
## Dancing 0.345 0.202 0.211 -0.311 0.108 0.310
## Musical.instruments 0.200 0.303 0.213 -0.230
## Writing -0.169 -0.250 -0.224 -0.178 0.358
## Passive.sport -0.128 0.274
## Active.sport 0.220 0.172 -0.432 0.108 -0.301
## Gardening -0.564 -0.193 0.189 0.236
## Celebrities 0.248 -0.252 0.131 0.254 -0.187 -0.201
## Shopping 0.101 -0.178 0.282 0.122 0.118
## Science.and.technology -0.189 0.554 -0.102 -0.148
## Theatre 0.185 -0.147 -0.153
## Fun.with.friends -0.304 -0.160 0.126
## Adrenaline.sports -0.111 -0.365 -0.278 -0.278
## Pets 0.240 -0.263 0.152
## Comp.23 Comp.24 Comp.25 Comp.26 Comp.27 Comp.28 Comp.29
## History 0.129 0.487 0.247 0.164
## Psychology -0.228 -0.162 -0.152 -0.144 -0.131
## Politics 0.119 -0.337 0.355 0.335 -0.385 0.120
## Mathematics -0.195
## Physics -0.172 -0.129 0.263
## Internet 0.219 -0.206 0.359 -0.223
## PC 0.189 -0.442 0.499
## Economy.Management 0.180 0.443 -0.235 -0.101
## Biology 0.102
## Chemistry 0.221 0.135 0.165
## Reading 0.303 -0.144 -0.245 -0.303 -0.348 -0.174
## Geography -0.175 -0.272
## Foreign.languages -0.174 0.149 0.178 0.190 0.137 0.118
## Medicine -0.110 0.155
## Law -0.438 -0.128 -0.253 0.319
## Cars 0.262 -0.220 -0.261
## Art.exhibitions -0.126 0.242 0.332 0.122 -0.262 -0.311
## Religion 0.142
## Countryside..outdoors -0.135 -0.142 0.280 -0.132
## Dancing 0.274 -0.102 0.142 0.260
## Musical.instruments 0.103 -0.257 -0.171 -0.179
## Writing -0.110 0.165 -0.197 0.289 0.140
## Passive.sport 0.112 -0.129
## Active.sport -0.258 -0.139 0.125 -0.204 0.126
## Gardening -0.112 0.157 -0.193 0.135
## Celebrities 0.318 -0.237 -0.228 0.169 0.194
## Shopping -0.342 0.322 0.255 -0.186 -0.295 -0.215 0.163
## Science.and.technology -0.107 0.175 -0.435
## Theatre -0.202 0.486 0.372
## Fun.with.friends -0.126 -0.305 0.136
## Adrenaline.sports 0.290 0.419 -0.245
## Pets
## Comp.30 Comp.31 Comp.32
## History 0.179
## Psychology -0.195
## Politics 0.113
## Mathematics 0.575
## Physics 0.123 -0.650
## Internet
## PC 0.162
## Economy.Management -0.239
## Biology -0.792
## Chemistry -0.623 0.390
## Reading -0.155
## Geography
## Foreign.languages
## Medicine 0.632 0.397
## Law
## Cars
## Art.exhibitions -0.102
## Religion -0.111
## Countryside..outdoors
## Dancing
## Musical.instruments
## Writing 0.150
## Passive.sport
## Active.sport
## Gardening
## Celebrities
## Shopping 0.134
## Science.and.technology
## Theatre -0.175 0.161
## Fun.with.friends
## Adrenaline.sports
## Pets
From the analysis, it can be observed that first 8 component explains more than %50 of the total variance. Reading affects first component positively while second component is affected by reading negatively.
More than %80 of the variance is explained by only 20 principal components. Here, you can see the cumulative variance by principal components.
In this part, in order to provide a good visual representation, we will use multidimensional scaling technique. The key here is the usage of correlation matrix as similarity measure.
#Set the seed for reproducibility
set.seed(120)
# Here 1 makes the distance values positive!
yr_dist<- 1 - cor(filtered_survey)
yr_mds <- cmdscale(yr_dist,k=2)
#Provide column names
colnames(yr_mds) <- c("x","y")
print(yr_mds)## x y
## History 0.06372563 0.012956654
## Psychology 0.23826552 -0.024422880
## Politics -0.14749496 0.107881476
## Mathematics -0.32345626 -0.297702691
## Physics -0.32464296 -0.499240678
## Internet -0.43673303 0.187118496
## PC -0.58516404 -0.107666761
## Economy.Management -0.30422083 0.383264449
## Biology 0.25483318 -0.395803325
## Chemistry 0.12745384 -0.486209735
## Reading 0.54412403 -0.002498553
## Geography -0.04741594 0.042633809
## Foreign.languages 0.20488850 0.223464512
## Medicine 0.21225026 -0.352629731
## Law -0.07751035 0.252871501
## Cars -0.58387376 0.046912232
## Art.exhibitions 0.37143545 0.009747048
## Religion 0.19597972 -0.174122589
## Countryside..outdoors 0.14091867 -0.071530114
## Dancing 0.27743365 0.136486519
## Musical.instruments 0.16521053 -0.134880364
## Writing 0.30297822 -0.032446839
## Passive.sport -0.30096850 0.108795592
## Active.sport -0.24240576 0.047405197
## Gardening 0.19439442 -0.077481786
## Celebrities 0.06494596 0.467218540
## Shopping 0.20807212 0.463704403
## Science.and.technology -0.36445481 -0.278443856
## Theatre 0.44547786 0.054248884
## Fun.with.friends -0.05316373 0.253721685
## Adrenaline.sports -0.30291870 0.041430796
## Pets 0.08203608 0.095218109
Now, we have a better insight about the hobbies and interests of young people! As we figured out before, physics has close relationship with mathematics,although not as close as Science.and.technology and mathematics! Chemistry, medicine and biology also are three very close interests. Unsurprisingly again, writing, art.exhibitions and theater have close relationship with each other.